Take-home Exercise 2: Regionalisation of Multivariate Water Point Attributes with Non-spatially Constrained and Spatially Constrained Clustering Methods

Author

Hulwana

1 Overview

The process of creating regions is called regionalisation. A regionalisation is a special kind of clustering where the objective is to group observations which are similar in their statistical attributes, but also in their spatial location. In this sense, regionalization embeds the same logic as standard clustering techniques, but also applies a series of geographical constraints. Often, these constraints relate to connectivity: two candidates can only be grouped together in the same region if there exists a path from one member to another member that never leaves the region. These paths often model the spatial relationships in the data, such as contiguity or proximity. However, connectivity does not always need to hold for all regions, and in certain contexts it makes sense to relax connectivity or to impose different types of geographic constraints.

1.1 Getting Started

1.1.1 Load Packages

For our analysis, we will utilize the following packages:

  • sf - for importing and processing geospatial data,

  • tidyverse - for importing and processing non-spatial data. In this exercise, readr package will be used for importing wkt data and dplyr package will be used to wrangling the data.

We will run the following code chunk to load the required packages:

pacman::p_load(sf, tidyverse, tmap, spdep, funModeling, ggpubr, corrplot, 
               heatmaply)

1.1.2 Import Data

1.1.2.1 Importing water point data

wp_nga <- read_csv("data/aspatial/WPdx.csv") %>%
  filter(`#clean_country_name` == "Nigeria")

Thing to learn from the code chunk above:

  • The original file name is called Water_Point_Data_Exchange_-_PlusWPdx.csv, it has been rename to WPdx.csv for easy encoding.

  • Instead of using read.csv() of Base R to import the csv file into R, read_csv() is readr package is used. This is because during the initial data exploration, we notice that there is at least one field name with space between the field name (ie. New Georeferenced Column)

  • The data file contains water point data of many countries. In this study, we are interested on water point in Nigeria on. Hence, filter() of dplyr is used to extract out records belong to Nigeria only.

Convert wkt data

After the data are imported into R environment, it is a good practice to review both the data structure and the data table if it is in tibble data frame format in R Studio.

Notice that the newly imported tibble data frame (i.e. wp_nga) contains a field called New Georeferenced Column which represent spatial data in a textual format. In fact, this kind of text file is popularly known as Well Known Text in short wkt.

Two steps will be used to convert an asptial data file in wkt format into a sf data frame by using sf.

First, st_as_sfc() of sf package is used to derive a new field called Geometry as shown in the code chunk below.

wp_nga$Geometry <- st_as_sfc(wp_nga$`New Georeferenced Column`)

Next, st_sf() will be used to convert the tibble data frame into sf data frame.

wp_nga <- st_sf(wp_nga, crs=4326) %>% st_transform(crs = 26391)
wp_nga

When the process completed, a new sf data frame called wp_sf will be created.

1.1.2.2 Importing Nigeria LGA level boundary data

For the purpose of this exercise, shapefile downloaded from geoBoundaries portal will be used.

nga <- st_read(dsn = "data/geospatial",
               layer = "nga_admbnda_adm2_osgof_20190417",
               crs = 4326) %>%
  st_transform(crs = 26391) %>%
  select(3:4,8:9,17)
Reading layer `nga_admbnda_adm2_osgof_20190417' from data source 
  `C:\hulwana\ISSS624\Take-Home_Ex\Take-Home_Ex2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 774 features and 16 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS:  WGS 84

1.2 Data Preparation

Before proceeding to the geospatial analysis, we will first prepare the data.

1.2.1 Checking duplicated area names

We will first check if there are any duplicated areas by running the following code chunk:

dup <- nga$ADM2_EN[duplicated(nga$ADM2_EN)]
dup
[1] "Bassa"    "Ifelodun" "Irepodun" "Nasarawa" "Obi"      "Surulere"

From the above, we see that areas Bassa, Ifelodun, Irepodun, Nasarawa, Obi and Surulere have duplicated labelling.

We will then plot the duplicated areas to determine to understand where are the areas with duplicated names

dup_areas <- nga %>%
  filter(ADM2_EN %in% dup) %>%
  select(ADM2_EN, geometry)

state_borders <- nga %>%
  select(ADM1_EN, geometry)

tmap_mode("view")

tm_shape(state_borders) +
  tm_fill("ADM1_EN") +
tm_shape(dup_areas) +
   tm_polygons("ADM2_EN", alpha = 0.08) +
  tm_layout(legend.show = FALSE)
tmap_mode("plot")

Upon searching on the web based on the information gathers in nigeriainfopedia, we realized that the duplication in names exist due to areas having similar names in different state. The states at which these areas are located are as follows:

dup_areas_state <- nga %>%
  filter(ADM2_EN %in% dup) %>%
  select(ADM2_EN, ADM1_EN)

dup_areas_state
Simple feature collection with 12 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 99926.41 ymin: 271934.4 xmax: 729231.7 ymax: 893313
Projected CRS: Minna / Nigeria West Belt
First 10 features:
    ADM2_EN  ADM1_EN                       geometry
1     Bassa     Kogi MULTIPOLYGON (((555599.8 44...
2     Bassa  Plateau MULTIPOLYGON (((704592.8 70...
3  Ifelodun    Kwara MULTIPOLYGON (((273735.2 55...
4  Ifelodun     Osun MULTIPOLYGON (((255291.7 43...
5  Irepodun    Kwara MULTIPOLYGON (((305947.5 46...
6  Irepodun     Osun MULTIPOLYGON (((235603 4279...
7  Nasarawa     Kano MULTIPOLYGON (((677128.6 89...
8  Nasarawa Nasarawa MULTIPOLYGON (((608258.1 51...
9       Obi    Benue MULTIPOLYGON (((663064.8 34...
10      Obi Nasarawa MULTIPOLYGON (((727739.2 48...

Since these areas have duplicated names, it might result in an inaccurate analysis, and therefore has to be recoded by executing the following code chunk:

nga$ADM2_EN[nga$ADM2_EN == "Bassa" & nga$ADM1_EN == "Kogi"] <- "Bassa (Kogi)"
nga$ADM2_EN[nga$ADM2_EN == "Bassa" & nga$ADM1_EN == "Plateau"] <- "Bassa (Plateau)"
nga$ADM2_EN[nga$ADM2_EN == "Ifelodun" & nga$ADM1_EN == "Kwara"] <- "Ifelodun (Kwara)"
nga$ADM2_EN[nga$ADM2_EN == "Ifelodun" & nga$ADM1_EN == "Osun"] <- "Ifelodun (Osun)"
nga$ADM2_EN[nga$ADM2_EN == "Irepodun" & nga$ADM1_EN == "Kwara"] <- "Irepodun (Kwara)"
nga$ADM2_EN[nga$ADM2_EN == "Irepodun" & nga$ADM1_EN == "Osun"] <- "Irepodun (Osun)"
nga$ADM2_EN[nga$ADM2_EN == "Nasarawa" & nga$ADM1_EN == "Kano"] <- "Nasarawa (Kano)"
nga$ADM2_EN[nga$ADM2_EN == "Nasarawa" & nga$ADM1_EN == "Nasarawa"] <- "Nasarawa (Nasarawa)"
nga$ADM2_EN[nga$ADM2_EN == "Obi" & nga$ADM1_EN == "Benue"] <- "Obi (Benue)"
nga$ADM2_EN[nga$ADM2_EN == "Obi" & nga$ADM1_EN == "Nasarawa"] <- "Obi (Nasarawa)"
nga$ADM2_EN[nga$ADM2_EN == "Surulere" & nga$ADM1_EN == "Lagos"] <- "Surulere (Lagos)"
nga$ADM2_EN[nga$ADM2_EN == "Surulere" & nga$ADM1_EN == "Oyo"] <- "Surulere (Oyo)"

Check if there are duplicated in LGA names after the clean-up

nga$ADM2_EN[duplicated(nga$ADM2_EN)]
character(0)

1.3 Data Wrangling

1.3.1 Extract all the required variables and recode if needed

Since we would like to understand if there are any relation ship on the number of functional and non-functional point, we would need to ensure that the variables required are cleaned. We will first load the data to see what are the fields present by using the glimpse() function.

glimpse(wp_nga)

In total there are 71 fields each having 95,008 observations. Thus, we will select only the required columns needed for our analysis.

The data required for our analysis are:

  • Total number of functional water points

  • Total number of nonfunctional water points

  • Percentage of functional water points

  • Percentage of non-functional water points

  • Percentage of main water point technology (i.e. Hand Pump)

  • Percentage of usage capacity (i.e. < 1000, >=1000)

  • Percentage of rural water points

Thus we will:

  1. Select the columns `#water_tech_category`, `#status_clean` and is_urban

  2. Additional columns selected: `#subjective_quality` and usage_capacity

  3. Tidy the name of variables that starts with “#”

wp_rev <- wp_nga %>%
  select(10,22,26,46,47) %>%
  rename(`water_tech` = `#water_tech_category`, `status_clean` = `#status_clean`,
         `quality` = `#subjective_quality` )

Since, we are interested to know how many functional and non-functional taps there are, we execute the following code to count the number of functional and non-functional taps as well as get the percentages of each type of taps.

freq(data=wp_rev, 
     input = 'status_clean')

                      status_clean frequency percentage cumulative_perc
1                       Functional     45883      48.29           48.29
2                   Non-Functional     29385      30.93           79.22
3                          Unknown     10656      11.22           90.44
4      Functional but needs repair      4579       4.82           95.26
5 Non-Functional due to dry season      2403       2.53           97.79
6        Functional but not in use      1686       1.77           99.56
7         Abandoned/Decommissioned       234       0.25           99.81
8                        Abandoned       175       0.18           99.99
9 Non functional due to dry season         7       0.01          100.00

1.3.1.1 Recoding NA values into String

We observed that there are more than 10% of observations that are NAs for this field. Thus, we will recode it into ‘Unknown’.

wp_rev <- wp_rev %>%
   dplyr::mutate(status_clean = 
           replace_na(status_clean, "Unknown"))

1.3.2 Extracting Water Point Data

We will re-group the water point categories into the following

  • Unknown

  • Functional

  • Non-functional

1.3.2.1 Extracting Water Point with Unknown Class

In the code chunk below, filter() of dplyr is used to select water points with unknown status.

wpt_unknown <- wp_rev %>%
  filter(status_clean == "Unknown")

1.3.2.2 Extracting Functional Water Point

In the code chunk below, filter() of dplyr is used to select functional water points.

We will consider the following categories as functional water points:

  • Functional

  • Functional but not in use

  • Functional but needs repair

wpt_functional <- wp_rev %>%
  filter(status_clean %in%
           c("Functional", 
             "Functional but not in use",
             "Functional but needs repair"))

1.3.2.3 Extracting Non-Functional Water Point

On the other hand, the following categories, will be grouped as non-functional water points:

  • Non-Functional

  • Non-Functional due to dry season

  • Abandoned/Decommissioned

  • Abandoned

  • Non functional due to dry season

wpt_nonfunctional <- wp_rev %>%
  filter(status_clean %in%
           c("Non-Functional",
             "Non-Functional due to dry season",
             "Abandoned/Decommissioned",
             "Abandoned",
             "Non functional due to dry season"))

1.3.2.4 Performing Point-in-Polygon Count

To count the number of different categories of water points found by LGA, we will utilize the mutate() function for the calculation as shown in the code:

nga_wp <- nga %>% 
  mutate(`total wpt` = lengths(
    st_intersects(nga, wp_rev))) %>%
  mutate(`wpt functional` = lengths(
    st_intersects(nga, wpt_functional))) %>%
  mutate(`wpt non-functional` = lengths(
    st_intersects(nga, wpt_nonfunctional))) %>%
  mutate(`wpt unknown` = lengths(
    st_intersects(nga, wpt_unknown)))

1.3.2.5 Compute the Percentages of Water Points

To compute the percentages of functional and non-functional water points, we execute the following code

nga_wp <- nga_wp %>%
  mutate(pct_functional = `wpt functional`/`total wpt`) %>%
  mutate(`pct_non-functional` = `wpt non-functional`/`total wpt`)

1.3.3 Extracting Water Technology Data

To see what are the different types of water technology present as well as its distribution, we run the following code:

freq(data=wp_rev, 
     input = 'water_tech')

       water_tech frequency percentage cumulative_perc
1       Hand Pump     58755      61.84           61.84
2 Mechanized Pump     25644      26.99           88.83
3            <NA>     10055      10.58           99.41
4        Tapstand       553       0.58           99.99
5 Rope and Bucket         1       0.00          100.00

Observed that the dominating type of water technology belongs to the ‘Hand Pump’ category at 61.84% of the water point found in Nigeria. As the number of ‘Mechanized Pump’ is substantially large we will also consider the percentage of this type of water point technology in our analysis. The number of water points that are either ‘Tapstand’ and ‘Rope and Bucket’ is too small and thus will not be considered as a variable in our analysis.

1.3.3.1 Extracting Hand Pump Water Points

wpt_hand <- wp_rev %>%
  filter(water_tech == "Hand Pump")

1.3.3.2 Extracting Mechanized Pump Water Points

wpt_mechanized <- wp_rev %>%
  filter(water_tech == "Mechanized Pump")

1.3.3.3 Performing Point-in-Polygon Count

To count the number of different categories of water point techinlogies found in each LGA, we will utilize the mutate() function for the calculation as shown in the code:

nga_wp <- nga_wp %>% 
  mutate(`wpt hand` = lengths(
    st_intersects(nga_wp, wpt_hand))) %>%
  mutate(`wpt mechanized` = lengths(
    st_intersects(nga_wp, wpt_mechanized)))

1.3.3.4 Compute the Percentages of Hand Pump and Mechanized Pump Water Points

To compute the percentages of functional and non-functional water points, we execute the following code:

nga_wp <- nga_wp %>%
  mutate(pct_hand = `wpt hand`/`total wpt`) %>%
  mutate(`pct_mechanized` = `wpt mechanized`/`total wpt`)

1.3.4 Extracting Rural and Urban Areas

1.3.4.1 Extract data on rural areas

wpt_rural <- wp_rev %>%
  filter(is_urban == FALSE)

1.3.4.2 Performing Point-in-Polygon Count

nga_wp <- nga_wp %>% 
  mutate(`wpt rural` = lengths(
    st_intersects(nga_wp, wpt_rural)))

1.3.4.3 Compute the Percentages of Rural Areas

nga_wp <- nga_wp %>%
  mutate(pct_rural = `wpt rural`/`total wpt`)

1.3.5 Extracting Quality

1.3.5.1 Different Categories for Quality

freq(data=wp_rev, 
     input = 'quality')

                              quality frequency percentage cumulative_perc
1                  Acceptable quality     71801      75.57           75.57
2                                <NA>     10625      11.18           86.75
3                 No because of Taste      6712       7.06           93.81
4                No because of Colour      3986       4.20           98.01
5                 No because of Odour      1847       1.94           99.95
6    Within National limits (potable)        19       0.02           99.97
7 Within National standards (potable)        18       0.02          100.00

1.3.5.2 Acceptable Quality

wpt_acceptable <- wp_rev %>%
  filter(quality %in%
           c("Acceptable quality", 
             "Within National standards (potable)",
             "Within National limits (potable)"))

1.3.5.3 Unacceptable Quality

wpt_unacceptable <- wp_rev %>%
  filter(quality %in%
           c("No because of Taste", 
             "No because of Colour",
             "No because of Odour"))

1.3.5.4 Performing Point-in-Polygon Count

nga_wp <- nga_wp %>% 
  mutate(`wpt acceptable` = lengths(
    st_intersects(nga_wp, wpt_acceptable))) %>%
  mutate(`wpt unacceptable` = lengths(
    st_intersects(nga_wp, wpt_unacceptable)))

1.3.5.5 Compute the Percentages of Acceptable and Unacceptable Water Quality

nga_wp <- nga_wp %>%
  mutate(pct_acceptable = `wpt acceptable`/`total wpt`) %>%
  mutate(pct_unacceptable = `wpt unacceptable`/`total wpt`)

1.3.6 Usage Capacity

1.3.6.1 Extracting Usage Capacity Data

freq(data=wp_rev, 
     input = 'usage_capacity')

  usage_capacity frequency percentage cumulative_perc
1            300     68789      72.40           72.40
2           1000     25644      26.99           99.39
3            250       573       0.60           99.99
4             50         2       0.00          100.00

We see that there are 2 groups with stubstantial number of observations which are 300 and 1000. Thus, we will recode the groups to those with less than or equal to 300 and more than 300.

1.3.6.2 Usage capacity of 300 or lesser

wpt_less300 <- wp_rev %>%
  filter(usage_capacity < 301)

1.3.6.3 Usage capacity greater than 300

wpt_more300 <- wp_rev %>%
  filter(usage_capacity >= 1000)

1.3.6.4 Performing Point-in-Polygon Count

nga_wp <- nga_wp %>% 
  mutate(`wpt less300` = lengths(
    st_intersects(nga_wp, wpt_less300))) %>%
  mutate(`wpt more300` = lengths(
    st_intersects(nga_wp, wpt_more300)))

1.3.6.5 Compute the Percentages of Usage Capacity less than or greater than 301

nga_wp <- nga_wp %>%
  mutate(pct_less300 = `wpt less300`/`total wpt`) %>%
  mutate(pct_more300 = `wpt more300`/`total wpt`)

1.3.7 Saving Data

We will then save the cleaned data in rds format.

2 Exploratory Data Analysis

2.1 Summary Statistics

Let us review the summary statistics of the newly derived penetration rates using the code chunk below.

summary(nga_wp)
   ADM2_EN           ADM2_PCODE          ADM1_EN           ADM1_PCODE       
 Length:774         Length:774         Length:774         Length:774        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
          geometry     total wpt     wpt functional   wpt non-functional
 MULTIPOLYGON :774   Min.   :  0.0   Min.   :  0.00   Min.   :  0.00    
 epsg:26391   :  0   1st Qu.: 45.0   1st Qu.: 17.00   1st Qu.: 12.25    
 +proj=tmer...:  0   Median : 96.0   Median : 45.50   Median : 34.00    
                     Mean   :122.7   Mean   : 67.36   Mean   : 41.60    
                     3rd Qu.:168.8   3rd Qu.: 87.75   3rd Qu.: 60.75    
                     Max.   :894.0   Max.   :752.00   Max.   :278.00    
                                                                        
  wpt unknown     pct_functional   pct_non-functional    wpt hand     
 Min.   :  0.00   Min.   :0.0000   Min.   :0.0000     Min.   :  0.00  
 1st Qu.:  0.00   1st Qu.:0.3333   1st Qu.:0.2211     1st Qu.:  6.00  
 Median :  0.00   Median :0.4792   Median :0.3559     Median : 47.00  
 Mean   : 13.76   Mean   :0.5070   Mean   :0.3654     Mean   : 75.89  
 3rd Qu.: 17.75   3rd Qu.:0.6749   3rd Qu.:0.5082     3rd Qu.:111.00  
 Max.   :219.00   Max.   :1.0000   Max.   :1.0000     Max.   :764.00  
                  NA's   :13       NA's   :13                         
 wpt mechanized      pct_hand      pct_mechanized     wpt rural     
 Min.   :  0.00   Min.   :0.0000   Min.   :0.0000   Min.   :  0.00  
 1st Qu.: 11.00   1st Qu.:0.1860   1st Qu.:0.1250   1st Qu.: 23.00  
 Median : 25.50   Median :0.5255   Median :0.3193   Median : 64.00  
 Mean   : 33.12   Mean   :0.4956   Mean   :0.3818   Mean   : 97.45  
 3rd Qu.: 46.00   3rd Qu.:0.7857   3rd Qu.:0.5843   3rd Qu.:141.00  
 Max.   :245.00   Max.   :1.0000   Max.   :1.0000   Max.   :894.00  
                  NA's   :13       NA's   :13                       
   pct_rural      wpt acceptable   wpt unacceptable pct_acceptable  
 Min.   :0.0000   Min.   :  0.00   Min.   :  0.0    Min.   :0.0000  
 1st Qu.:0.5922   1st Qu.: 28.00   1st Qu.:  3.0    1st Qu.:0.5595  
 Median :0.8717   Median : 71.00   Median :  9.0    Median :0.7759  
 Mean   :0.7395   Mean   : 92.79   Mean   : 16.2    Mean   :0.7172  
 3rd Qu.:1.0000   3rd Qu.:127.00   3rd Qu.: 21.0    3rd Qu.:0.9221  
 Max.   :1.0000   Max.   :833.00   Max.   :225.0    Max.   :1.0000  
 NA's   :13                                         NA's   :13      
 pct_unacceptable   wpt less300      wpt more300      pct_less300    
 Min.   :0.00000   Min.   :  0.00   Min.   :  0.00   Min.   :0.0000  
 1st Qu.:0.04124   1st Qu.: 16.00   1st Qu.: 11.00   1st Qu.:0.4157  
 Median :0.10526   Median : 60.00   Median : 25.50   Median :0.6807  
 Mean   :0.15565   Mean   : 89.59   Mean   : 33.12   Mean   :0.6182  
 3rd Qu.:0.21429   3rd Qu.:127.75   3rd Qu.: 46.00   3rd Qu.:0.8750  
 Max.   :1.00000   Max.   :767.00   Max.   :245.00   Max.   :1.0000  
 NA's   :13                                          NA's   :13      
  pct_more300    
 Min.   :0.0000  
 1st Qu.:0.1250  
 Median :0.3193  
 Mean   :0.3818  
 3rd Qu.:0.5843  
 Max.   :1.0000  
 NA's   :13      

2.2 EDA using Histogram

Here, we take a look at the distribution of the percentages variable

functional <- ggplot(data=nga_wp, 
             aes(x= `pct_functional`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

nonfunctional <- ggplot(data=nga_wp, 
             aes(x= `pct_non-functional`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hand <- ggplot(data=nga_wp, 
             aes(x= `pct_hand`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

mechanized <- ggplot(data=nga_wp, 
             aes(x= `pct_mechanized`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

rural <- ggplot(data=nga_wp, 
             aes(x= `pct_rural`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

acceptable <- ggplot(data=nga_wp, 
             aes(x= `pct_acceptable`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

unacceptable <- ggplot(data=nga_wp, 
             aes(x= `pct_unacceptable`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

less300 <- ggplot(data=nga_wp, 
             aes(x= `pct_less300`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

more300 <- ggplot(data=nga_wp, 
             aes(x= `pct_more300`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

ggarrange(functional, nonfunctional, hand, mechanized, acceptable, unacceptable,
          less300, more300, rural,
          ncol = 3, 
          nrow = 3)

Notice that the variables pct_acceptable, pct_cap300 and pct_rural are skewed to the left. Whereas, variables pct_mechanized, pct_unacceptable and pct_cap1000 are skewed to the right.

2.3 EDA using Chloropleth Map

2.3.1 Total Waterpoints

qtm(nga_wp, "total wpt")

From the above map, we notice that there are a number of areas in the north-eastern part of Nigeria in which there 0 data on the number of water points.

2.3.2 Distribution of Functional and Non-functional Water Points by Percentages

tmap_mode("view")

pct_functional <- tm_shape(nga_wp) + 
  tm_fill("pct_functional") + 
  tm_borders() 

pct_nonfunctional <- tm_shape(nga_wp) + 
  tm_fill("pct_non-functional") + 
  tm_borders() 

tmap_arrange(pct_functional, pct_nonfunctional,
             asp = 1, ncol = 2,
             sync = TRUE)
tmap_mode("plot")

In terms of functional water points, areas in the northern region have high percentages in comparison to other parts in Nigeria, while in terms of non-functional water points, areas in southern part of Nigeria have higher percentages.

2.3.3 Distribution of Hand Pump and Mechanized Pump

tmap_mode("view")

pct_hand <- tm_shape(nga_wp) + 
  tm_fill("pct_hand", palette = "BuGn") + 
  tm_borders() 

pct_mechanized <- tm_shape(nga_wp) + 
  tm_fill("pct_mechanized", palette = "BuGn") + 
  tm_borders() 

tmap_arrange(pct_hand, pct_mechanized,
             asp = 1, ncol = 2,
             sync = TRUE)
tmap_mode("plot")

We see a similar to that of the distribution of functional and non-functional water points, where the norther regions of Nigeria tend to have higher percentages of hand pump but lower percentage of mechanized pump whereas southern regions of Nigeria tended to have higher percentages of mechanized pumps but lower percentages of hand pumps. Perhaps this could be attributed to the the technology and development of the different regions and therefore will proceed to look if there is a similar trend in terms of rural and urban areas.

2.3.4 Distribution of Rural Areas by Percentages

tmap_mode("view")

tm_shape(nga_wp) + 
  tm_fill("pct_rural", palette = "YlGn") + 
  tm_borders(col = "black", lwd = 2)
tmap_mode("plot")

Surprisingly, in terms of percentage of rural areas, it is almost homogeneous throughout Nigeria.

2.3.5 Distribution of Water Quality by Percentages

tmap_mode("view")

pct_accept <- tm_shape(nga_wp) + 
  tm_fill("pct_acceptable", palette = "Blues") + 
  tm_borders()

pct_unaccept <- tm_shape(nga_wp) + 
  tm_fill("pct_unacceptable", palette = "Blues") + 
  tm_borders()

tmap_arrange(pct_accept, pct_unaccept,
             asp = 1, ncol = 2,
             sync = TRUE)
tmap_mode("plot")

Based on the percentages of acceptable water quality, we see that the the distribution is also quite homogeneous among the areas in Nigeria with majority having high percentage of acceptable water quality.

2.3.6 Distribution of Usage Capacity

tmap_mode("view")

pct_less300 <- tm_shape(nga_wp) + 
  tm_fill("pct_less300", palette = "BuPu") + 
  tm_borders()

pct_more300 <- tm_shape(nga_wp) + 
  tm_fill("pct_more300", palette = "BuPu") + 
  tm_borders()

tmap_arrange(pct_less300, pct_more300,
             asp = 1, ncol = 2,
             sync = TRUE)
tmap_mode("plot")

We observe that the high percentages of usage capacity that is equal or greater than 1000 is prevalent in the southern part of Nigeria and some areas towards the extreme north-western region of Nigeria.

2.3.7 Further Data Preparation

From the above EDA, we notice that there are a number of areas in the north eastern part which seems to have missing data. This results in NAs when calculating the percentages and thus should be replaced to 0.

nga_wp <- nga_wp %>%
  replace(is.na(.), 0)

2.4 Multicollinearity

2.4.1 Correlation Analysis

Before we perform cluster analysis, it is important for us to ensure that the cluster variables are not highly correlated.

In this section, you will learn how to use corrplot.mixed() function of corrplot package to visualise and analyse the correlation of the input variables.

nga_wp2 <- nga_wp %>%
  select(7,8,10,11,14,15,17,20,21,24,25) %>%
  st_drop_geometry()

cluster_vars.cor = cor(nga_wp2)
corrplot.mixed(cluster_vars.cor,
         lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

Based on the correlation value computed, the following variables have high correlation of 0.8 and above:

  • pct_hand and pct_cap300

  • pct_mechanized and pct_cap300

  • pct_mechanized and pct_cap1000

  • pct_cap300 and pct_cap1000

Since these variables are highly collinear to each other, we should remove one from each pair of correlated variables.

However as we are unsure on which variables to remove, we can calculate the VIF values to conclude which are the variables to be excluded.

3 Clustering Analysis

In this section, we will perform hierarchical clustering.

3.1 Additional Data Preparation

Before we even proceed with the clustering, we need to further prepare the data.

3.1.1 Extracting clustering variables

The code chunk below will be used to extract the clustering variables from the nga_wp2 simple feature object into data.frame.

As hierarchical clustering does not consider geospatial data, thus it will be excluded by the code st_set_geometry().

nga_clust <- nga_wp %>%
  select(1,7,8,10,11,14,15,17,20,21) %>%
  st_drop_geometry()

Notice that the final clustering variables list does not include variable pct_cap300 and pct_cap1000 because it is highly correlated with the variables pct_hand and pct _mechanized respectively.

Next, we need to change the rows by LGA instead of row number by using the code chunk below:

row.names(nga_clust) <- nga_clust$"ADM2_EN"
head(nga_clust,10)
                      ADM2_EN wpt functional wpt non-functional pct_functional
Aba North           Aba North              7                  9      0.4117647
Aba South           Aba South             29                 35      0.4084507
Abadam                 Abadam              0                  0      0.0000000
Abaji                   Abaji             23                 34      0.4035088
Abak                     Abak             23                 25      0.4791667
Abakaliki           Abakaliki             82                 42      0.3519313
Abeokuta North Abeokuta North             16                 15      0.4705882
Abeokuta South Abeokuta South             72                 33      0.6050420
Abi                       Abi             79                 62      0.5197368
Aboh-Mbaise       Aboh-Mbaise             18                 26      0.2727273
               pct_non-functional   pct_hand pct_mechanized  pct_rural
Aba North               0.5294118 0.11764706      0.8235294 0.00000000
Aba South               0.4929577 0.09859155      0.8732394 0.05633803
Abadam                  0.0000000 0.00000000      0.0000000 0.00000000
Abaji                   0.5964912 0.40350877      0.5964912 0.84210526
Abak                    0.5208333 0.08333333      0.9166667 0.83333333
Abakaliki               0.1802575 0.43776824      0.0944206 0.87553648
Abeokuta North          0.4411765 0.14705882      0.7647059 0.20588235
Abeokuta South          0.2773109 0.16806723      0.7058824 0.00000000
Abi                     0.4078947 0.59868421      0.3289474 0.95394737
Aboh-Mbaise             0.3939394 0.01515152      0.6515152 0.72727273
               pct_acceptable pct_unacceptable
Aba North           0.7647059       0.17647059
Aba South           0.8028169       0.09859155
Abadam              0.0000000       0.00000000
Abaji               0.9824561       0.01754386
Abak                0.7291667       0.27083333
Abakaliki           0.4206009       0.11158798
Abeokuta North      0.7352941       0.17647059
Abeokuta South      0.8655462       0.01680672
Abi                 0.5657895       0.36184211
Aboh-Mbaise         0.5303030       0.13636364

Notice that the row number has been replaced into the LGA name. This is because hierarchical clustering does not need the LGA name. However, since we need to reference back to the LGA when we deduce the insights, thus we retain it as the rownames/ object id.

Now, we will delete the ADM2_EN field by using the code chunk below.

nga_clust <- select(nga_clust, c(2:10))
head(nga_clust, 10)
               wpt functional wpt non-functional pct_functional
Aba North                   7                  9      0.4117647
Aba South                  29                 35      0.4084507
Abadam                      0                  0      0.0000000
Abaji                      23                 34      0.4035088
Abak                       23                 25      0.4791667
Abakaliki                  82                 42      0.3519313
Abeokuta North             16                 15      0.4705882
Abeokuta South             72                 33      0.6050420
Abi                        79                 62      0.5197368
Aboh-Mbaise                18                 26      0.2727273
               pct_non-functional   pct_hand pct_mechanized  pct_rural
Aba North               0.5294118 0.11764706      0.8235294 0.00000000
Aba South               0.4929577 0.09859155      0.8732394 0.05633803
Abadam                  0.0000000 0.00000000      0.0000000 0.00000000
Abaji                   0.5964912 0.40350877      0.5964912 0.84210526
Abak                    0.5208333 0.08333333      0.9166667 0.83333333
Abakaliki               0.1802575 0.43776824      0.0944206 0.87553648
Abeokuta North          0.4411765 0.14705882      0.7647059 0.20588235
Abeokuta South          0.2773109 0.16806723      0.7058824 0.00000000
Abi                     0.4078947 0.59868421      0.3289474 0.95394737
Aboh-Mbaise             0.3939394 0.01515152      0.6515152 0.72727273
               pct_acceptable pct_unacceptable
Aba North           0.7647059       0.17647059
Aba South           0.8028169       0.09859155
Abadam              0.0000000       0.00000000
Abaji               0.9824561       0.01754386
Abak                0.7291667       0.27083333
Abakaliki           0.4206009       0.11158798
Abeokuta North      0.7352941       0.17647059
Abeokuta South      0.8655462       0.01680672
Abi                 0.5657895       0.36184211
Aboh-Mbaise         0.5303030       0.13636364

3.2 Data Standardization

In general, multiple variables will be used in cluster analysis. It is not unusual their values range are different. In order to avoid the cluster analysis result is baised to clustering variables with large values, it is useful to standardise the input variables before performing cluster analysis.

3.2.1 Min-Max Standardization

In the code chunk below, normalize() of heatmaply package is used to stadardisation the clustering variables by using Min-Max method. The summary() is then used to display the summary statistics of the standardized clustering variables.

nga_clust.std <- normalize(nga_clust)
summary(nga_clust.std)
 wpt functional    wpt non-functional pct_functional   pct_non-functional
 Min.   :0.00000   Min.   :0.00000    Min.   :0.0000   Min.   :0.0000    
 1st Qu.:0.02261   1st Qu.:0.04406    1st Qu.:0.3261   1st Qu.:0.2105    
 Median :0.06051   Median :0.12230    Median :0.4741   Median :0.3505    
 Mean   :0.08957   Mean   :0.14962    Mean   :0.4984   Mean   :0.3592    
 3rd Qu.:0.11669   3rd Qu.:0.21853    3rd Qu.:0.6699   3rd Qu.:0.5076    
 Max.   :1.00000   Max.   :1.00000    Max.   :1.0000   Max.   :1.0000    
    pct_hand      pct_mechanized     pct_rural      pct_acceptable  
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.1670   1st Qu.:0.1220   1st Qu.:0.5727   1st Qu.:0.5425  
 Median :0.5099   Median :0.3127   Median :0.8645   Median :0.7706  
 Mean   :0.4873   Mean   :0.3754   Mean   :0.7271   Mean   :0.7052  
 3rd Qu.:0.7778   3rd Qu.:0.5771   3rd Qu.:1.0000   3rd Qu.:0.9209  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
 pct_unacceptable 
 Min.   :0.00000  
 1st Qu.:0.03748  
 Median :0.10000  
 Mean   :0.15303  
 3rd Qu.:0.20843  
 Max.   :1.00000  

Notice that the values range of the Min-max standardized clustering variables are 0-1 now.

3.2.2 Z-SCORE STANDARDIZATION

Z-score standardization can be performed easily by using scale() of Base R. The code chunk below will be used to stadardization the clustering variables by using Z-score method.

nga_clust.z <- scale(nga_clust)
describe(nga_clust.z)
nga_clust.z 

 9  Variables      774  Observations
--------------------------------------------------------------------------------
wpt functional 
         n    missing   distinct       Info       Mean        Gmd        .05 
       774          0        198          1 -5.251e-17     0.8924    -0.8068 
       .10        .25        .50        .75        .90        .95 
   -0.7575    -0.6217    -0.2698     0.2517     0.9214     1.8516 

lowest : -0.8315384 -0.8191936 -0.8068487 -0.7945039 -0.7821590
highest:  5.3408830  5.5877798  6.4642637  7.1308852  8.4517833
--------------------------------------------------------------------------------
wpt non-functional 
        n   missing  distinct      Info      Mean       Gmd       .05       .10 
      774         0       140         1 3.821e-18     1.063   -1.0940   -1.0671 
      .25       .50       .75       .90       .95 
  -0.7908   -0.2047    0.5162    1.3584    1.8798 

lowest : -1.120970 -1.094021 -1.067072 -1.040122 -1.013173
highest:  3.487351  3.649046  3.810742  4.700067  6.370920
--------------------------------------------------------------------------------
pct_functional 
        n   missing  distinct      Info      Mean       Gmd       .05       .10 
      774         0       619         1 7.561e-17     1.141   -1.5425   -1.2147 
      .25       .50       .75       .90       .95 
  -0.7119   -0.1005    0.7083    1.4846    1.7279 

lowest : -2.058859 -1.987641 -1.933688 -1.829379 -1.783483
highest:  2.018130  2.031671  2.034561  2.054983  2.071774
--------------------------------------------------------------------------------
pct_non-functional 
        n   missing  distinct      Info      Mean       Gmd       .05       .10 
      774         0       618         1 8.919e-17     1.139  -1.67369  -1.38449 
      .25       .50       .75       .90       .95 
 -0.70720  -0.04174   0.70553   1.35134   1.75002 

lowest : -1.708296 -1.688966 -1.665457 -1.662129 -1.646541
highest:  2.347601  2.367581  2.404301  2.470507  3.046894
--------------------------------------------------------------------------------
pct_hand 
        n   missing  distinct      Info      Mean       Gmd       .05       .10 
      774         0       619     0.999 5.741e-17     1.151  -1.49121  -1.43909 
      .25       .50       .75       .90       .95 
 -0.98007   0.06924   0.88910   1.33976   1.44345 

lowest : -1.491211 -1.463640 -1.462064 -1.451465 -1.450405
highest:  1.538273  1.541677  1.552523  1.552673  1.569186
--------------------------------------------------------------------------------
pct_mechanized 
        n   missing  distinct      Info      Mean       Gmd       .05       .10 
      774         0       632         1 8.855e-17     1.126   -1.2311   -1.1146 
      .25       .50       .75       .90       .95 
  -0.8670   -0.2145    0.6902    1.6241    1.8970 

lowest : -1.284605 -1.266140 -1.265972 -1.253843 -1.253728
highest:  2.074185  2.085707  2.089359  2.106728  2.137558
--------------------------------------------------------------------------------
pct_rural 
         n    missing   distinct       Info       Mean        Gmd        .05 
       774          0        447      0.976 -3.548e-17      1.043    -2.2280 
       .10        .25        .50        .75        .90        .95 
   -1.8822    -0.4730     0.4211     0.8363     0.8363     0.8363 

lowest : -2.2279624 -2.2025325 -2.1993241 -2.1963717 -2.1854027
highest:  0.8147612  0.8171289  0.8178811  0.8245095  0.8363408
--------------------------------------------------------------------------------
pct_acceptable 
        n   missing  distinct      Info      Mean       Gmd       .05       .10 
      774         0       607         1 1.538e-16     1.089   -2.1488   -1.4346 
      .25       .50       .75       .90       .95 
  -0.6317    0.2539    0.8379    1.0439    1.1268 

lowest : -2.738566 -2.621391 -2.534171 -2.427886 -2.385521
highest:  1.129149  1.130865  1.131213  1.135966  1.144935
--------------------------------------------------------------------------------
pct_unacceptable 
        n   missing  distinct      Info      Mean       Gmd       .05       .10 
      774         0       564     0.999 2.878e-18    0.9763   -0.9023   -0.9023 
      .25       .50       .75       .90       .95 
  -0.6813   -0.3127    0.3266    1.2025    2.0457 

lowest : -0.9022539 -0.8886376 -0.8814204 -0.8808921 -0.8782870
highest:  4.4230402  4.4576200  4.5219385  4.8157151  4.9936074
--------------------------------------------------------------------------------

Notice the mean and standard deviation of the Z-score standardised clustering variables are 0 and 1 respectively.

Note: describe() of psych package is used here instead of summary() of Base R because the earlier provides standard deviation.

Warning: Z-score standardisation method should only be used if we would assume all variables come from some normal distribution.

3.2.3 Visualizing the Standardized Clustering Variables

Beside reviewing the summary statistics of the standardized clustering variables, it is also a good practice to visualize their distribution graphical.

The code chunk below plot the scaled pct_hand field.

r <- ggplot(data=nga_wp, 
             aes(x= `pct_mechanized`)) +
  geom_histogram(bins=40, 
                 color="black", 
                 fill="light blue")

nga_clust.std_df <- as.data.frame(nga_clust.std)
s <- ggplot(data=nga_clust.std_df, 
       aes(x=`pct_mechanized`)) +
  geom_histogram(bins=40, 
                 color="black", 
                 fill="light blue") +
  ggtitle("Min-Max Standardisation")

nga_clust.z_df <- as.data.frame(nga_clust.z)
z <- ggplot(data=nga_clust.z_df, 
       aes(x=`pct_mechanized`)) +
  geom_histogram(bins=40, 
                 color="black", 
                 fill="light blue") +
  ggtitle("Z-score Standardisation")

ggarrange(r, s, z,
          ncol = 1,
          nrow = 3)